perm filename CHAP6[4,KMC]4 blob sn#036268 filedate 1973-04-17 generic text, type T, neo UTF8
00010			CHAPTER SIX
00100			MODEL VALIDATION
00200	(In collaboration with Franklin Dennis Hilf)
00300	
00500	
00600	
00700		There are several meanings to the term "validate" which
00800	derive from the Latin VALIDUS= strong. Thus to validate X means to
00900	strengthen it. In science it usually means to strengthen X's
01000	acceptability as a hypothesis, theory , or model. Lurking in the
01100	background there is usually some concept of truth or authenticity.
01300		In a purely instrumentalist view theories are simply
01400	calculating or predicting devices for human convenience. They do not
01500	explain and it is unjustified to apply the terms of truth or falsity
01600	to them. Under a realist view one seeks  explanatory truth,
01700	that which really is the case, and hence proposed theories must be
01800	evaluated for their authenticity. Since absolute truth cannot be attained
01900	we must settle for degrees of approximations.
02000	To validate, then, is to carry out procedures
02100	which show to what degree X, or its consequences, correspond with
02200	facts of observation. We compare the model with its natural counterpart
02210	The failures should be constructive yielding new information.Discrepancies 
02220	in the comparison reveal what is not understood and must be modified in the model. After modifications
02230	are made, a fresh comparison is made with the natural counterpart and
02240	we repeatedly cycle through this procedure attempting to gain convergence.
02400	
02500		Once  a  simulation  model  reaches  a  stage  of   intuitive
02600	adequacy,  a  model  builder  should  consider  using  more stringent
02700	evaluation procedures relevant to the model's purposes. For  example,
02800	if  the  model  is  to serve as a as a training device, then a simple
02900	evaluation of its pedagogic effectiveness would be sufficient.    But
03000	when  the  model  is  proposed  as  an  explantion of a psychological
03100	process, more is demanded of the evaluation procedure. In the area of
03200	simulation models Turing's test has often been suggested as a validation procedure.
03300		It  is  very easy to become confused about Turing's Test.  In
03400	part this is due to Turing  himself  who  introduced  the  now-famous
03500	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
03600	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
03700	there  are  actually  two  imitation  games  , the second of which is
03800	commonly called Turing's test.
03900		In the first imitation game  two  groups  of  judges  try  to
04000	determine which of two interviewees is a woman. Communication between
04100	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
04200	informed  that  one  of the interviewees is a woman and one a man who
04300	will pretend to be a woman. After the interview, the judge  is  asked
04400	what  we shall call the woman-question i.e. which interviewee was the
04500	woman?  Turing does not say what else  the  judge  is  told  but  one
04600	assumes  the  judge is NOT told that a computer is involved nor is he
04700	asked to determine which  interviewee  is  human  and  which  is  the
04800	computer.  Thus,  the  first  group  of  judges  would  interview two
04900	interviewees:    a woman, and a man pretending to be a woman.
05000		The  second  group  of judges would be given the same initial
05100	instructions, but unbeknownst to them, the two interviewees would  be
05200	a  woman  and a computer programmed to imitate a woman.   Both groups
05300	of judges  play  this  game  until  sufficient  statistical  data are
05400	collected  to  show  how  often the right identification is made. The
05500	crucial question then is:  do the judges decide wrongly AS OFTEN when
05600	the  game  is  played  with man and woman as when it is played with a
05700	computer substituted  for  the  man.  If  so,  then  the  program  is
05800	considered  to  have  succeeded in imitating a woman as well as a man
05900	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
06000	woman-question  in  this  game,  judges  are not required to identify
06100	which interviewee is human and which is machine.
06200		Later  on  in  his  paper  Turing proposes a variation of the
06300	first game. In the second game, one interviewee is a man and one is  a
06400	computer.   The judge is asked to determine which is man and which is
06500	machine, which we shall call the machine-question. It is this version
06600	of  the game which is commonly thought of as Turing's test.    It has
06700	often been suggested as a means of validating computer simulations of
06800	psychological processes.
06900		In  the  course  of  testing a simulation (PARRY) of paranoid
07000	linguistic behavior in a psychiatric interview, we conducted a number
07100	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
07200	Kraemer,1972). We say `Turing-like' because none of them consisted of
07300	playing  the  two  games  described above. We chose not to play these
07400	games for a number of reasons which can be summarized by saying  that
07500	they  do  not  meet modern criteria for good experimental design.  In
07600	designing our tests we were primarily  interested  in  learning  more
07700	about   developing   the  model.   We  did  not  believe  the  simple
07800	machine-question to be  a  useful  one  in  serving  the  purpose  of
07900	progressively   increasing  the  credibility  of  the  model  but  we
08000	investigated a variation of it to satisfy the curiosity of colleagues
08100	in artificial intelligence.
08200		In this design eight psychiatrists  interviewed  by  teletype
08300	two  patients  using  the  technique of machine-mediated interviewing
08400	which involves  what  we  term  "non-nonverbal"  communication  since
08500	non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
08600	interviewed two patients one being PARRY and one being a hospitalized
08700	paranoid  patient.    The  interviewers  were  not  informed  that  a
08800	simulation was involved nor were they asked to identify which was the
08900	machine. Their task was to conduct a diagnostic psychiatric interview
09000	and rate each response from the  `patients'  along  a  0-9  scale  of
09100	paranoidness,  0  meaning  zero  and  9 being highest. Transcripts of
09200	these interviews, without the ratings of the interviewers, were  then
09300	utilized  for  various  experiments in which randomly selected expert
09400	judges conducted evaluations  of  the  interview  transcripts.    For
09500	example,  in one experiment it was found that patients and model were
09600	indistinguishable along the dimension of paranoidness.
09610	(Elaborate from ttt paper here  giving interviews, data, tables etc.)
09700		To ask the machine-question, we sent  interview  transcripts,
09800	one  with a patient and one with PARRY, to 100 psychiatrists randomly
09900	selected from the Directory of American Specialists and the Directory
10000	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
10100	made the correct identification while 20 (49%) were wrong.  Based  on
10200	this  random  sample of 41 psychiatrists, the 95% confidence interval
10300	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
10400	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
10500	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
10600	University.)
10700		Psychiatrists   are   considered  expert  judges  of  patient
10800	interview behavior but they are unfamiliar with computers.  Hence  we
10900	conducted  the  same  test  with  100  computer  scientists  randomly
11000	selected from the membership list of the  Association  for  Computing
11100	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
11200	were wrong. Based on this random sample of 67 computer scientists the
11300	95% confidence ranges from 36 to 60, again close to a chance level.
11400		Thus the answer to this machine-question "can expert  judges,
11500	psychiatrists  aand  computer scientists, using teletyped transcripts
11600	of psychiatric interviews, distinguish between paranoid patients  and
11700	a  simulation  of paranoid processes? " is "No". But what do we learn
11800	from this?   It is some comfort that the answer was not "yes"and  the
11900	null  hypothesis  (no  differences) failed to be rejected, especially
12000	since statistical tests are somewhat biased in favor of rejecting the
12100	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
12200	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
12300	Simulation  models  do  not  spring  forth in a complete, perfect and
12400	final form; they must be gradually developed  over  time.  Pehaps  we
12500	might  obtain  a "yes" answer to the machine-question if we allowed a
12600	large number of expert judges to conduct  the  interviews  themselves
12700	rather  than studying transcripts of other interviewers.     It would
12800	indicate that the model must be improved but unless we systematically
12900	investigated how the judges succeeded in making the discrimination we
13000	would not know what aspects of the model to work on. The logistics of
13100	such a design are immense and obtaining a large N of judges for sound
13200	statistical inference would require an effort disproportionate to the
13300	information-yield.
13400		A more efficient and informative way to use Turing-like tests
13500	is to ask judges to make ordinal ratings along scaled dimensions from
13600	teletyped  interviews.     We  shall  term  this  approach asking the
13700	dimension-question.   One can then compare scaled ratings received by
13800	the patients and by the model to precisely determine where and by how
13900	much they differ.        Model builders  strive  for  a  model  which
14000	shows     indistinguishability     along    some    dimensions    and
14100	distinguishability along others.  That is, the model converges on what
14200	it is supposed to simulate and diverges from that which it is not.
14300		We  mailed  paired-interview  transcripts  to   another   400
14400	randomly  selected psychiatrists asking them to rate the responses of
14500	the two `patients' along certain dimensions. The judges were  divided
14600	into  groups,  each  judge  being asked to rate responses of each I-O
14700	pair in the interviews along four dimensions.  The  total  number  of
14800	dimensions  in  this  test  were twelve- linguistic noncomprehension,
14900	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
15000	ideas  of  reference, delusions, mistrust, depression, suspiciousness
15100	and mania. These are dimensions which psychiatrists commonly  use  in
15200	evaluating patients.
15300		Table 1 shows there were significant differences, with  PARRY
15400	receiving   higher   scores   along   the  dimensions  of  linguistic
15500	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
15600	suspiciousness. On the dimension of delusions the patients were rated
15700	significantly higher. There were no significant differences along the
15800	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
15900	depression and mania.
16000		While    tests    asking    the   machine-question   indicate
16100	indistinguishability at  the  gross  level,  a  study  of  the  finer
16200	structure  os  the  model's  behavior  through  ratings  along scaled
16300	dimensions  shows  statistically  significant   differences   between
16400	patients  and  model.     These  differences are of help to the model
16500	builder in suggesting which aspects of the model must be modified and
16600	improved  in  order  to  be  considered an adequate simulation of the
16700	class of paranoid patients it is intended to simulate.  For  example,
16800	it  is  clear  that  PARRY'S language-comprehension must be improved.
16900	Once this has been implemented, a future test will  tell  us  whether
17000	improvement has occurred and by how much in comparison to the earlier
17100	version.   Successive identification of particular areas  of  failure
17200	in the model permits their improvement and the development of
17300	more adequate model-versions.
17400		Further evidence that the machine-question is too coarse  and
17500	insensitive  a test comes from the following experiment. In this test
17600	we constructed a random version of the paranoid model which  utilized
17700	PARRY'S  output statements but expressed them randomly no matter what
17800	the interviewer said.   Two psychiatrists conducted  interviews  with
17900	this  model, transcripts of which were paired with patient interviews
18000	and sent to 200  randomly  selected  psychiatrists  asking  both  the
18100	machine-question  and  the dimension-question.  Of the 69 replies, 34
18200	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
18300	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
18400	again indicating  a  chance  level.  However  as  shown  in  Table  2
18500	significant  differences  appear  along  the dimensions of linguistic
18600	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
18700	rated  higher.  On  these  particular  dimensions  we can construct a
18800	continuum in which the random version  represents  one  extreme,  the
18900	actual patients another. Our (nonrandom) PARRY lies somewhere between
19000	these two extremes, indicating that it performs significantly  better
19100	than  the  random version but still requires improvement before being
19200	indistinguishable from  patients.(See  Fig.1).  Table  3  presents  t
19300	values   for   differences   between   mean   ratings  of  PARRY  and
19400	RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
19500		Thus it can be seen that  such  a  multidimensional  analysis
19600	provides  yardsticks  for measuring the adequacy of this or any other
19700	dialogue simulation model along the relevant dimensions.
19800		We conclude that when model builders want  to  conduct  tests
19900	of adequacy which  indicate  in  which  direction  progress  lies and to obtain a
20000	measure of whether  progress  is  being  achieved,  the  way  to  use
20100	Turing-like  tests  is  to  ask  expert  judges to make ratings along
20200	multiple dimensions that are essential to the model. A good validation
20210	procedure has criteris for better or worse approximations. Useful tests do
20300	not  prove  a  model, they probe it for its strengths and weaknesses and
20310	clarify what is to be done next in modifying and repairing the model.
20400	Simply asking the machine-question yields little information relevant
20500	to what the model builder most wants  to  know,  namely,  along  what
20600	dimensions must the model be improved.
20700	
20800